Instructions and Expectations

Predicting voter behavior is complicated for many reasons despite the tremendous effort in collecting, analyzing, and understanding many available datasets. For our final project, we will analyze the 2016 presidential election dataset.

Background

The presidential election in 2012 did not come as a surprise. Some correctly predicted the outcome of the election correctly including Nate Silver, and many speculated his approach.

Despite the success in 2012, the 2016 presidential election came as a big surprise to many, and it was a clear example that even the current state-of-the-art technology can surprise us.

Answer the following questions in one paragraph for each.

1. What makes voter behavior prediction (and thus election forecasting) a hard problem?

The data that we have can only be from the poll before the election actually happens. A lot of times, voters’ opinions will change as the election gets closer. This could be due to a particularly successful campaign ad or a sudden scandal of a candidate being exposed days before election. These random variables are difficult to measure, therefore, up until the election day, the polls generated are never 100 precent accurate. Also, sampling errors could orrcur when researchers accidentally poll more supporters of a candidate than are represented in the general population. And the voters might not answer truthfully. For example, if someone feels like he/she might get attacked by others around them if they reveal their true political opinion, then they might not speak their mind until the day of the election. On top of that, not everyone who answered the suyvey will turn out to vote.

2. What was unique to Nate Silver’s approach in 2012 that allowed him to achieve good predictions?

Instead of looking at the maximum probability, Nate Silver looked at the full range of probabilities. For each date, he proposed a prior belief of how much a candidate would get the new day. The next day he would get the new data containing the actual probability of the support, and then updaet his model after observing the actual data.

3. What went wrong in 2016? What do you think should be done to make future predictions better?

The biggest issues with 2016 election prediction were: a large change in voters’ preferences during the final days before the election, overrepresentation of college graduates did not get adjusted and many of the Trump supporters did not speak their mind during the poll until the day of election. One of the ways to improve predicting results is to ask the voters about their past voting behavior. We can combine voter’s self prediction and their past history to determine whether or not they will actually vote in the end.

Data

## read data and convert candidate from string to factor
election.raw <- read_delim("data/election/election.csv", delim = ",") %>% mutate(candidate=as.factor(candidate))

census_meta <- read_delim("data/census/metadata.csv", delim = ";", col_names = FALSE) 
census <- read_delim("data/census/census.csv", delim = ",") 

Election data

The meaning of each column in election.raw is clear except fips. The accronym is short for Federal Information Processing Standard.

In our dataset, fips values denote the area (US, state, or county) that each row of data represent. For example, fips value of 6037 denotes Los Angeles County.

kable(election.raw %>% filter(county == "Los Angeles County"))  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
county fips candidate state votes
Los Angeles County 6037 Hillary Clinton CA 2464364
Los Angeles County 6037 Donald Trump CA 769743
Los Angeles County 6037 Gary Johnson CA 88968
Los Angeles County 6037 Jill Stein CA 76465
Los Angeles County 6037 Gloria La Riva CA 21993

Some rows in election.raw are summary rows and these rows have county value of NA. There are two kinds of summary rows:

  • Federal-level summary rows have fips value of US.
  • State-level summary rows have names of each states as fips value.

4. Report the dimension of election.raw after removing rows with fips=2000. Provide a reason for excluding them. Please make sure to use the same name election.raw before and after removing those observations.

# removing rows with fips=2000
election.raw %>% filter(fips != 2000)
## # A tibble: 18,345 x 5
##    county fips  candidate                   state    votes
##    <chr>  <chr> <fct>                       <chr>    <int>
##  1 <NA>   US    Donald Trump                US    62984825
##  2 <NA>   US    Hillary Clinton             US    65853516
##  3 <NA>   US    Gary Johnson                US     4489221
##  4 <NA>   US    Jill Stein                  US     1429596
##  5 <NA>   US    Evan McMullin               US      510002
##  6 <NA>   US    Darrell Castle              US      186545
##  7 <NA>   US    Gloria La Riva              US       74117
##  8 <NA>   US    Rocky De La Fuente          US       33010
##  9 <NA>   US    " None of these candidates" US       28863
## 10 <NA>   US    Richard Duncan              US       24235
## # ... with 18,335 more rows

The dimension for election.row is now 18,345 x 5. We filtered out fips = 2000 because numerical values of fips represent county-level data. And we have missing values in the county column for fips = 2000.

Census data

Following is the first few rows of the census data:

kable(census %>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
CensusTract State County TotalPop Men Women Hispanic White Black Native Asian Pacific Citizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office Construction Production Drive Carpool Transit Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment
1001020100 Alabama Autauga 1948 940 1008 0.9 87.4 7.7 0.3 0.6 0.0 1503 61838 11900 25713 4548 8.1 8.4 34.7 17.0 21.3 11.9 15.2 90.2 4.8 0 0.5 2.3 2.1 25.0 943 77.1 18.3 4.6 0 5.4
1001020200 Alabama Autauga 2156 1059 1097 0.8 40.4 53.3 0.0 2.3 0.0 1662 32303 13538 18021 2474 25.5 40.3 22.3 24.7 21.5 9.4 22.0 86.3 13.1 0 0.0 0.7 0.0 23.4 753 77.0 16.9 6.1 0 13.3
1001020300 Alabama Autauga 2968 1364 1604 0.0 74.5 18.6 0.5 1.4 0.3 2335 44922 5629 20689 2817 12.7 19.7 31.4 24.9 22.1 9.2 12.4 94.8 2.8 0 0.0 0.0 2.5 19.6 1373 64.1 23.6 12.3 0 6.2
1001020400 Alabama Autauga 4423 2172 2251 10.5 82.8 3.7 1.6 0.0 0.0 3306 54329 7003 24125 2870 2.1 1.6 27.0 20.8 27.0 8.7 16.4 86.6 9.1 0 0.0 2.6 1.6 25.3 1782 75.7 21.2 3.1 0 10.8
1001020500 Alabama Autauga 10763 4922 5841 0.7 68.5 24.8 0.0 3.8 0.0 7666 51965 6935 27526 2813 11.4 17.5 49.6 14.2 18.2 2.1 15.8 88.0 10.5 0 0.0 0.6 0.9 24.8 5037 67.1 27.6 5.3 0 4.2
1001020600 Alabama Autauga 3851 1787 2064 13.1 72.9 11.9 0.0 0.0 0.0 2642 63092 9585 30480 7550 14.4 21.9 24.2 17.5 35.4 7.9 14.9 82.7 6.9 0 0.0 6.0 4.5 19.8 1560 79.4 14.7 5.8 0 10.9

Census data: column metadata

Column information is given in metadata.

Data wrangling

5. Remove summary rows from election.raw data: i.e.,

  • Federal-level summary into a election_federal.

  • State-level summary into a election_state.

  • Only county-level data is to be in election.

election_federal <- election.raw %>% filter(fips == "US")
election_state <- election.raw %>% filter(fips == state & fips != "US")
election <- election.raw %>% filter(!is.na(county))

6. How many named presidential candidates were there in the 2016 election? Draw a bar chart of all votes received by each candidate. You can split this into multiple plots or may prefer to plot the results on a log scale. Either way, the results should be clear and legible!

# list of candidates 
kable(election_federal$candidate %>% head, "html", caption = "List of Candidates")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
# count the number of candidates
# length(unique(election_federal$candidate))

election_federal %>% 
  filter(candidate !=  " None of these candidates") %>% 
  ggplot(aes(candidate, votes)) + 
  scale_y_log10() +
  geom_bar(stat = "identity") +
  ggtitle("Votes received by each candidate on a log scale") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), plot.title = element_text(hjust = 0.5))  

There are 31 named presidential candidates in the 2016 election

7. Create variables county_winner and state_winner by taking the candidate with the highest proportion of votes. Hint: to create county_winner, start with election, group by fips, compute total votes, and pct = votes/total. Then choose the highest row using top_n (variable state_winner is similar).

state_winner <- election_state %>% group_by(fips, add = TRUE) %>% mutate(total = sum(votes)) %>% mutate(pct = votes/total) %>% top_n(1, pct)
county_winner <- election %>% group_by(fips, add = TRUE) %>% mutate(total = sum(votes)) %>% mutate(pct = votes/total) %>% top_n(1, pct)

Visualization

Visualization is crucial for gaining insight and intuition during data mining. We will map our data onto maps.

The R package ggplot2 can be used to draw maps. Consider the following code.

states <- map_data("state")

ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") + 
  coord_fixed(1.3) +
  guides(fill=FALSE)  # color legend is unnecessary and takes too long

The variable states contain information to draw white polygons, and fill-colors are determined by region.

8. Draw county-level map by creating counties = map_data("county"). Color by county

county <- map_data("county")

ggplot(data = county) + 
  geom_polygon(aes(x = long, y = lat, fill = subregion, group = group), color = "white") + 
  coord_fixed(1.3) +
  guides(fill=FALSE)  # color legend is unnecessary and takes too long

9. Now color the map by the winning candidate for each state. First, combine states variable and state_winner we created earlier using left_join(). Note that left_join() needs to match up values of states to join the tables. A call to left_join() takes all the values from the first table and looks for matches in the second table. If it finds a match, it adds the data from the second table; if not, it adds missing values:

Here, we’ll be combing the two datasets based on state name. However, the state names are in different formats in the two tables: e.g. AZ vs. arizona. Before using left_join(), create a common column by creating a new column for states named fips = state.abb[match(some_column, some_function(state.name))]. Replace some_column and some_function to complete creation of this new column. Then left_join(). Your figure will look similar to state_level New York Times map.

states <- states %>% mutate(fips = state.abb[match(region, tolower(state.name))])
states <- left_join(states, state_winner)

# color the map by the winning candidate for each state
ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white") + 
  coord_fixed(1.3)

10. The variable county does not have fips column. So we will create one by pooling information from maps::county.fips. Split the polyname column to region and subregion. Use left_join() combine county.fips into county. Also, left_join() previously created variable county_winner. Your figure will look similar to county-level New York Times map.

county.fips <- maps::county.fips 
county.fips <- county.fips %>% separate(polyname, c("region", "subregion"), sep = ",")

county <- left_join(county, county.fips)
county_winner$fips <- as.numeric(county_winner$fips)
county <- left_join(county, county_winner)

# color the map by the winning candidate for each county
ggplot(data = county) + 
  geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white") + 
  coord_fixed(1.3)

11. Create a visualization of your choice using census data. Many exit polls noted that demographics played a big role in the election. Use this Washington Post article and this R graph gallery for ideas and inspiration.

census.explore <- census %>% 
  filter(complete.cases(census[setdiff(names(census), 'CensusTract')])) %>% 
  mutate(Hispanic.num = Hispanic/100 * TotalPop,
             White.num = White/100 * TotalPop, 
           Black.num = Black/100 * TotalPop, 
           Native.num = Native/100 * TotalPop, 
           Asian.num = Asian/100 * TotalPop, 
           Pacific.num = Pacific/100 *TotalPop) %>% 
  select(CensusTract:Women, Hispanic.num:Pacific.num) %>%
  group_by(State) %>%
  summarize_at(vars(TotalPop:Pacific.num), funs(sum(.))) %>% 
  gather(key = 'race_population', value = 'population', Hispanic.num:Pacific.num) %>% 
  separate(race_population, c("race", "num")) %>% 
  select(-num)

ggplot(census.explore, aes(State, population, fill = race)) + 
  geom_bar(position = "identity", stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Population in each state by race")

12. The census data contains high resolution information (more fine-grained than county-level). In this problem, we aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. Create the following variables:

census.del <- census %>% 
  filter(complete.cases(census[setdiff(names(census), 'CensusTract')])) %>% 
  mutate(Men = Men / TotalPop * 100, 
         Employed = Employed / TotalPop * 100, 
         Citizen = Citizen / TotalPop * 100) %>% 
  mutate(Minority = Hispanic + Black + Native + Asian + Pacific) %>% 
  select(-c(Women, Hispanic, Black, Native, Asian, Pacific, Walk, PublicWork, Construction))
census.subct <- census.del %>% 
  group_by(State, County, add = TRUE) %>% 
  add_tally(TotalPop) %>% 
  mutate(CountyTotal = n) %>%
  select(-n) %>% 
  mutate(weight = TotalPop / CountyTotal)
census.ct <- census.subct %>% 
  summarize_at(vars(Men:Minority), funs(sum(. * weight)))
kable(census.ct[c(100:105, 1000:1005), ] %>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
State County Men White Citizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office Production Drive Carpool Transit OtherTransp WorkAtHome MeanCommute Employed PrivateWork SelfEmployed FamilyWork Unemployment Minority
Arizona Gila 49.72444 63.67513 77.48519 38128.94 7742.802 20583.43 3187.475 22.64903 33.19307 29.21547 23.52462 22.07370 9.974047 76.89079 11.73858 0.4016853 2.3503640 5.121422 21.53425 33.78350 70.47708 4.664701 0.1417869 13.202261 34.89076
Arizona Graham 53.59692 51.69957 69.02451 45252.63 6686.589 17481.32 2676.082 21.57643 27.19014 29.99648 21.39130 20.94082 12.228340 76.34010 12.87230 1.6948459 2.3358275 3.910525 21.80545 31.95659 70.23631 4.140407 0.1225840 13.688962 47.37405
Arizona Greenlee 52.62108 47.24315 68.55813 50250.27 8026.202 21994.42 2603.522 13.62131 19.44737 25.96103 13.10784 13.88864 15.038524 79.39181 13.07987 0.0000000 2.0490081 3.328073 19.68985 37.08301 80.31936 4.778854 0.0000000 10.592741 51.65261
Arizona La Paz 51.71903 50.87192 68.81057 34854.57 5399.670 19496.09 3259.001 21.12987 32.03473 22.07872 20.15581 22.40299 10.369465 78.02055 11.51815 0.5578606 0.8026418 5.226134 12.36893 35.18919 59.01420 9.368088 0.0000000 10.631233 47.64221
Arizona Maricopa 49.47870 56.64921 65.53484 60542.26 9669.632 27761.68 3879.826 17.26386 22.76252 35.17442 18.92922 26.92891 10.023129 76.14856 11.37012 2.4650033 2.6253456 5.772809 25.68234 46.05740 82.44120 5.762338 0.1545059 7.945101 41.15112
Arizona Mohave 50.33930 78.21948 76.97800 39249.58 7123.632 20973.73 3353.569 19.84279 28.50158 24.53038 24.93882 27.41707 12.484930 77.40242 14.63040 0.4397705 2.4444572 3.068395 20.01382 32.88717 78.19054 6.494262 0.3120396 13.113418 19.59276

Dimensionality reduction

13. Run PCA for both county & sub-county level data. Save the first two principle components PC1 and PC2 into a two-column data frame, call it ct.pc and subct.pc, respectively. Discuss whether you chose to center and scale the features before running PCA and the reasons for your choice. What are the three features with the largest absolute values of the first principal component? Which features have opposite signs and what does that mean about the correaltion between these features?

# County level 
pr.out.ct <- prcomp(census.ct[, -c(1:2)], scale = TRUE, center = TRUE)
ct.pc <- pr.out.ct$x[, 1:2]
# Highest absolute loadings for PC1
pc1.vector.ct <- abs(pr.out.ct$rotation[, 1])
pc1.vector.ct <- sort(pc1.vector.ct, decreasing = TRUE) 
kable(head(pc1.vector.ct)%>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
x
IncomePerCap 0.3515553
ChildPoverty 0.3436011
Poverty 0.3423083
Employed 0.3278373
Income 0.3200339
Unemployment 0.2907011
# Look for negative correlation in PC1
kable(pr.out.ct$rotation[, 1], caption = "Negative correlations in PC1, county level"%>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
Negative correlations in PC1, county level
x
Men 0.0068248
White 0.2230261
Citizen 0.0046889
Income 0.3200339
IncomeErr 0.1698692
IncomePerCap 0.3515553
IncomePerCapErr 0.1945209
Poverty -0.3423083
ChildPoverty -0.3436011
Professional 0.2501644
Service -0.1817970
Office -0.0150122
Production -0.1187617
Drive -0.0941528
Carpool -0.0767026
Transit 0.0707426
OtherTransp -0.0096675
WorkAtHome 0.1748976
MeanCommute -0.0586108
Employed 0.3278373
PrivateWork 0.0564638
SelfEmployed 0.0977609
FamilyWork 0.0486661
Unemployment -0.2907011
Minority -0.2265319
# Subcounty level 
pr.out.subct <- prcomp(census.subct[, -c(1:3)], scale = TRUE, center = TRUE)
subct.pc <- pr.out.subct$x[, 1:2]
# Highest absolute loadings for PC1
pc1.vector.subct <- abs(pr.out.subct$rotation[, 1])
pc1.vector.subct <- sort(pc1.vector.subct, decreasing = TRUE) 
kable(head(pc1.vector.subct)%>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
x
IncomePerCap 0.3181191
Professional 0.3064354
Poverty 0.3046909
Income 0.3025096
ChildPoverty 0.2978868
Service 0.2688339
# Look for negative correlation in PC1
kable(pr.out.subct$rotation[, 1], caption = "Negative correlations in PC1, subcounty level"%>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
Negative correlations in PC1, subcounty level
x
TotalPop -0.0324186
Men -0.0172915
White -0.2404300
Citizen -0.1608628
Income -0.3025096
IncomeErr -0.1989363
IncomePerCap -0.3181191
IncomePerCapErr -0.2123317
Poverty 0.3046909
ChildPoverty 0.2978868
Professional -0.3064354
Service 0.2688339
Office 0.0138010
Production 0.2068050
Drive -0.0789278
Carpool 0.1625664
Transit 0.0573112
OtherTransp 0.0451425
WorkAtHome -0.1729726
MeanCommute -0.0099985
Employed -0.2212063
PrivateWork 0.0420216
SelfEmployed -0.0697035
FamilyWork -0.0152021
Unemployment 0.2528185
Minority 0.2420290
CountyTotal 0.0215341
weight 0.0119266

I choose to center and scale the features before I run PCA. In general, it is a good idea to scale the variables. Otherwise the magnitude to certain variables might dominates the associations between the variables. In our data, the variables are not recorded in the same scale, some of them are in percentage and some of them are not.

The three features with the largest absolute values of the first principal component for county level data are: IncomePerCap, ChildPoverty and Poverty. The three features with the largest absolute values of the first principal component for sub-county level data are: IncomePerCap, Professional and Poverty.

Negative loadings means some negative correlations among variables. Features have opposite signs in PC1 of county level data are Poverty, ChildPoverty, Service, Office, Production, Drive, Carpool, OtherTransp, MeanCommute, UNemployment and Minority. Features have opposite signs in PC1 of sub-county level data are just the oppsite of county level data. This just means that variables like Poverty, ChildPoverty are negatively correlated to variables like Employed, Professional and Income. Another interesting negative correlation worth pointing out is Minority. Larger minority group corresponds with less income / less employed etc. in the area.

14. Determine the number of minimum number of PCs needed to capture 90% of the variance for both the county and sub-county analyses. Plot proportion of variance explained (PVE) and cumulative PVE for both county and sub-county analyses.

# County level 
pr.var.ct <- pr.out.ct$sdev ^ 2
pve.ct <- pr.var.ct/sum(pr.var.ct) 
cumulative_pve.ct <- cumsum(pve.ct)

# Number of minimum number of PCs needed to capture 90% of the variance
# min(which(cumulative_pve.ct >= 0.90))

## This will put the next two plots side by side 
par(mfrow=c(1, 2))

## Plot proportion of variance explained
plot(pve.ct, type="l", lwd=3, main = "PVE for County")
plot(cumulative_pve.ct, type="l", lwd=3, main = "Cumulative PVE for County")

#Sub-county level
pr.var.subct <- pr.out.subct$sdev ^ 2
pve.subct <- pr.var.subct/sum(pr.var.subct) 
cumulative_pve.subct <- cumsum(pve.subct)

## This will put the next two plots side by side 
par(mfrow=c(1, 2))

# Number of minimum number of PCs needed to capture 90% of the variance
# min(which(cumulative_pve.subct >= 0.90))

## Plot proportion of variance explained
plot(pve.subct, type="l", lwd=3, main = "PVE for Sub-County")
plot(cumulative_pve.subct, type="l", lwd=3, main = "Cumulative PVE for Sub-County")

Minimum number of PCs needed to capture 90% of the variance for the county is 13 and for the sub-county is 17.

Clustering

15. With census.ct, perform hierarchical clustering with complete linkage. Cut the tree to partition the observations into 10 clusters. Re-run the hierarchical clustering algorithm using the first 2 principal components from ct.pc as inputs instead of the originald features. Compare and contrast the results. For both approaches investigate the cluster that contains San Mateo County. Which approach seemed to put San Mateo County in a more appropriate clusters? Comment on what you observe and discuss possible explanations for these observations.

set.seed(1)

# hierarchical clustering for census.ct
census.ct.scaled <- as.data.frame(scale(census.ct[, -c(1:2)]))
dis.census.ct <- dist(census.ct.scaled, method="euclidean")
census.ct.hclust <- hclust(dis.census.ct, method = "complete")  
census.ct.hclust <- cutree(census.ct.hclust, k = 10)

# hierarchical clustering for ct.pc
ct.pc.scaled <- as.data.frame(scale(ct.pc))
dis.ct.pc <- dist(ct.pc.scaled, method="euclidean")
ct.pc.hclust <- hclust(dis.ct.pc, method = "complete")  
ct.pc.hclust <- cutree(ct.pc.hclust, k = 10)
kable(census.ct %>% filter(County == "San Mateo"), caption = "San Mateo County"%>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
San Mateo County
State County Men White Citizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office Production Drive Carpool Transit OtherTransp WorkAtHome MeanCommute Employed PrivateWork SelfEmployed FamilyWork Unemployment Minority
California San Mateo 49.19773 40.63851 64.2005 100369.9 16123.02 47881.29 6115.552 8.011122 9.705514 45.73565 18.28979 22.304 7.343291 69.92713 10.68144 9.257082 2.598808 5.077957 26.82681 51.72497 79.76635 8.367532 0.1716192 6.689483 55.53405
# Samples within the cluster containing "San Mateo"
# census.ct.hclust[which(census.ct$County == "San Mateo")]
samples.census.ct <- census.ct[which(census.ct.hclust == 7), ]
kable(summary(samples.census.ct), caption = "Sample summary for cluster 7"%>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
Sample summary for cluster 7
State </th>
County </th>
  Men </th>
 White </th>
Citizen </th>
 Income </th>
IncomeErr IncomePerCap IncomePerCapErr
Poverty </th>
ChildPoverty Professional
Service </th>
 Office </th>
Production
 Drive </th>
Carpool </th>
Transit </th>
OtherTransp WorkAtHome MeanCommute
Employed </th>
PrivateWork SelfEmployed FamilyWork Unemployment
Minority </th>
Length:115 Length:115 Min. :46.77 Min. :28.17 Min. :58.75 Min. : 36373 Min. : 7151 Min. :22991 Min. :3117 Min. : 2.739 Min. : 0.000 Min. :26.51 Min. : 5.00 Min. :11.97 Min. : 1.519 Min. :36.14 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 1.400 Min. : 6.30 Min. :40.96 Min. :52.80 Min. : 2.917 Min. :0.00000 Min. : 1.400 Min. : 4.223
Class :character Class :character 1st Qu.:48.74 1st Qu.:60.90 1st Qu.:67.19 1st Qu.: 74500 1st Qu.:10803 1st Qu.:35306 1st Qu.:4245 1st Qu.: 5.997 1st Qu.: 7.105 1st Qu.:40.52 1st Qu.:13.40 1st Qu.:21.33 1st Qu.: 6.327 1st Qu.:68.65 1st Qu.: 7.145 1st Qu.: 1.188 1st Qu.: 1.002 1st Qu.: 4.580 1st Qu.:24.47 1st Qu.:48.94 1st Qu.:72.71 1st Qu.: 4.610 1st Qu.:0.09446 1st Qu.: 4.668 1st Qu.:13.998
Mode :character Mode :character Median :49.18 Median :72.84 Median :70.51 Median : 84360 Median :12273 Median :38039 Median :4784 Median : 7.581 Median : 9.686 Median :44.17 Median :15.29 Median :23.51 Median : 7.761 Median :77.85 Median : 7.979 Median : 3.510 Median : 1.349 Median : 5.388 Median :29.03 Median :51.21 Median :79.29 Median : 5.712 Median :0.13250 Median : 6.023 Median :24.117
NA NA Mean :49.65 Mean :69.73 Mean :70.35 Mean : 84333 Mean :12740 Mean :39527 Mean :5206 Mean : 8.781 Mean :10.490 Mean :45.11 Mean :15.80 Mean :22.73 Mean : 8.045 Mean :73.83 Mean : 8.254 Mean : 6.819 Mean : 1.828 Mean : 5.786 Mean :28.04 Mean :51.30 Mean :77.08 Mean : 6.439 Mean :0.14357 Mean : 6.254 Mean :27.763
NA NA 3rd Qu.:49.83 3rd Qu.:83.64 3rd Qu.:74.16 3rd Qu.: 93631 3rd Qu.:14238 3rd Qu.:43391 3rd Qu.:5627 3rd Qu.:10.833 3rd Qu.:12.059 3rd Qu.:50.07 3rd Qu.:17.87 3rd Qu.:24.97 3rd Qu.: 9.556 3rd Qu.:81.75 3rd Qu.: 9.190 3rd Qu.: 9.026 3rd Qu.: 2.048 3rd Qu.: 6.472 3rd Qu.:32.02 3rd Qu.:53.15 3rd Qu.:82.71 3rd Qu.: 6.667 3rd Qu.:0.16662 3rd Qu.: 7.567 3rd Qu.:36.628
NA NA Max. :60.06 Max. :93.99 Max. :79.25 Max. :128268 Max. :23381 Max. :65599 Max. :9699 Max. :26.606 Max. :32.357 Max. :74.22 Max. :25.95 Max. :26.44 Max. :16.100 Max. :91.30 Max. :16.000 Max. :39.424 Max. :10.238 Max. :15.900 Max. :42.70 Max. :64.08 Max. :86.55 Max. :22.737 Max. :0.80204 Max. :14.866 Max. :69.604
# ct.pc.hclust[which(census.ct$County == "San Mateo")]
samples.ct.cp <- census.ct[which(ct.pc.hclust == 4), ]
kable(summary(samples.ct.cp), caption = "Sample summary for cluster 4" %>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
Sample summary for cluster 4
State </th>
County </th>
  Men </th>
 White </th>
Citizen </th>
 Income </th>
IncomeErr IncomePerCap IncomePerCapErr
Poverty </th>
ChildPoverty Professional
Service </th>
 Office </th>
Production
 Drive </th>
Carpool </th>
Transit </th>
OtherTransp WorkAtHome MeanCommute
Employed </th>
PrivateWork SelfEmployed FamilyWork Unemployment
Minority </th>
Length:657 Length:657 Min. :45.84 Min. :14.30 Min. :53.72 Min. : 36373 Min. : 2934 Min. :21335 Min. :1540 Min. : 1.50 Min. : 0.00 Min. :20.32 Min. : 5.00 Min. :12.73 Min. : 2.40 Min. :32.13 Min. : 0.000 Min. : 0.0000 Min. : 0.0000 Min. : 1.262 Min. : 9.707 Min. :38.03 Min. :53.42 Min. : 2.608 Min. :0.00000 Min. : 0.900 Min. : 0.4824
Class :character Class :character 1st Qu.:48.95 1st Qu.:77.32 1st Qu.:72.15 1st Qu.: 54879 1st Qu.: 7006 1st Qu.:27695 1st Qu.:3038 1st Qu.: 8.20 1st Qu.: 9.99 1st Qu.:32.23 1st Qu.:14.90 1st Qu.:21.69 1st Qu.: 9.16 1st Qu.:77.07 1st Qu.: 7.815 1st Qu.: 0.2067 1st Qu.: 0.9023 1st Qu.: 3.708 1st Qu.:20.055 1st Qu.:47.71 1st Qu.:74.13 1st Qu.: 4.920 1st Qu.:0.09495 1st Qu.: 4.182 1st Qu.: 5.0390
Mode :character Mode :character Median :49.47 Median :87.82 Median :74.79 Median : 60656 Median : 8535 Median :29922 Median :3696 Median :10.15 Median :12.85 Median :35.56 Median :16.37 Median :23.61 Median :12.17 Median :80.74 Median : 8.887 Median : 0.5721 Median : 1.2320 Median : 4.783 Median :23.728 Median :49.70 Median :78.32 Median : 6.473 Median :0.15132 Median : 5.562 Median :10.5435
NA NA Mean :49.64 Mean :82.99 Mean :74.00 Mean : 64355 Mean : 8860 Mean :31192 Mean :3785 Mean :10.31 Mean :13.02 Mean :36.57 Mean :16.48 Mean :23.26 Mean :12.82 Mean :79.62 Mean : 9.139 Mean : 1.8255 Mean : 1.4183 Mean : 5.090 Mean :24.129 Mean :49.77 Mean :77.64 Mean : 7.161 Mean :0.20117 Mean : 5.589 Mean :15.0869
NA NA 3rd Qu.:50.13 3rd Qu.:93.61 3rd Qu.:76.86 3rd Qu.: 71308 3rd Qu.:10435 3rd Qu.:33528 3rd Qu.:4303 3rd Qu.:12.01 3rd Qu.:16.18 3rd Qu.:39.94 3rd Qu.:17.98 3rd Qu.:25.12 3rd Qu.:15.81 3rd Qu.:83.50 3rd Qu.:10.107 3rd Qu.: 1.7033 3rd Qu.: 1.7044 3rd Qu.: 6.121 3rd Qu.:27.805 3rd Qu.:51.96 3rd Qu.:81.92 3rd Qu.: 8.650 3rd Qu.:0.23676 3rd Qu.: 6.963 3rd Qu.:20.2654
NA NA Max. :55.51 Max. :99.17 Max. :86.43 Max. :128268 Max. :23381 Max. :51353 Max. :9699 Max. :23.49 Max. :27.62 Max. :60.07 Max. :26.05 Max. :31.20 Max. :33.17 Max. :91.30 Max. :22.400 Max. :51.5809 Max. :10.2377 Max. :12.895 Max. :42.770 Max. :62.79 Max. :87.94 Max. :21.500 Max. :1.85733 Max. :11.235 Max. :83.3101

Hierarchical clustering using census.ct seems like a better approach. From PCA analysis, we already know that the three features that play important role for county level data are: IncomePerCap, ChildPoverty and Poverty. Using census.ct data, San Mateo belongs in cluster 7. Using ct.pc data, San Mateo belongs to cluster 4. We then compared the mean of IncomePerCap, ChildPoverty and Poverty within each cluster with the actual IncomePerCap, ChildPoverty and Poverty rate of San Mateo. The samples in cluster 7 using census.ct.data has closer means to San Mateo than the samples in cluster 4 using ct.pc data. The same for other variables, samples from custer 7 using census.ct seem to have a closer mean to the actual observation. So cluster 7 using census.ct might be a better fit for San Mateo. This could be due to important cluster separation might sometimes take place in dimensions with weak variance, therefore dimension reduction might not always be better before clustering.

Classification

In order to train classification models, we need to combine county_winner and census.ct data. This seemingly straightforward task is harder than it sounds. Following code makes necessary changes to merge them into election.cl for classification.

tmpwinner <- county_winner %>% ungroup %>%
  mutate(state = state.name[match(state, state.abb)]) %>%               ## state abbreviations
  mutate_at(vars(state, county), tolower) %>%                           ## to all lowercase
  mutate(county = gsub(" county| columbia| city| parish", "", county))  ## remove suffixes
tmpcensus <- census.ct %>% ungroup %>% 
  mutate_at(vars(State, County), tolower)

election.cl <- tmpwinner %>%
  left_join(tmpcensus, by = c("state"="State", "county"="County")) %>% 
  na.omit

## save meta information
election.meta <- election.cl %>% select(c(county, fips, state, votes, pct, total))

## save predictors and class labels
election.cl = election.cl %>% select(-c(county, fips, state, votes, pct, total))

Using the following code, partition data into 80% training and 20% testing:

set.seed(10) 
n <- nrow(election.cl)
in.trn <- sample.int(n, 0.8*n) 
trn.cl <- election.cl[ in.trn,]
tst.cl <- election.cl[-in.trn,]

Using the following code, define 10 cross-validation folds:

set.seed(20) 
nfold <- 10
folds <- sample(cut(1:nrow(trn.cl), breaks=nfold, labels=FALSE))

Using the following error rate function:

calc_error_rate = function(predicted.value, true.value){
  return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=5, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso", "random forest", "knn")

Classification

16. Decision tree: train a decision tree by cv.tree(). Prune tree to minimize misclassification error. Be sure to use the folds from above for cross-validation. Visualize the trees before and after pruning. Save training and test errors to records variable. Intepret and discuss the results of the decision tree analysis. Use this plot to tell a story about voting behavior in the US (remember the NYT infographic?)

# Fit model on training set
tree.election <- tree(candidate ~., data = trn.cl) 

# Plot the tree
plot(tree.election)
text(tree.election, pretty = 0, cex = .5, col = "red")
title("Classification Tree Before Pruning")

# Set random seed
set.seed(69) 

# K-Fold cross validation
cv <- cv.tree(tree.election, FUN = prune.misclass, folds)
# Print out cv
# cv

# Best size
best.cv <- cv$size[which(cv$dev == min(cv$dev))]
best.size.cv <- min(best.cv)

# Prune tree.election
tree.election.pruned <- prune.misclass (tree.election, best = best.size.cv)

# Plot the tree
plot(tree.election.pruned)
text(tree.election.pruned, pretty = 0, cex = .6, col = "red")
title("Classification Tree After Pruning")

# Predict on train set and calculate train error
pred.tree.train <- predict(tree.election.pruned, trn.cl, type="class") 
tree_train_err <- calc_error_rate(pred.tree.train, trn.cl$candidate)

# Predict on test set and calculate test error
pred.tree.test <- predict(tree.election.pruned, tst.cl, type="class")
tree_test_err <- calc_error_rate(pred.tree.test, tst.cl$candidate)

records[1, 1] <- tree_train_err
records[1, 2] <- tree_test_err

kable(records, caption = "Records", "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
Records
train.error test.error
tree 0.0749186 0.0830619
logistic NA NA
lasso NA NA
random forest NA NA
knn NA NA

The decision tree model gave us a 0.074 error rate on training dataset and slightly higher error rate of 0.083 testing datset. Overall, it did a good job. From the classification tree plot after pruning, we can see that counties that rely on public transiportation more tend to vote for Hillary. This could be due to areas where people rely on public transit more are usually the urban areas. In addition, the mojority of white demographic majority areas voted for Trump. For regions that white population less than 48 percent, the areas where income are low will vote for Hillary. And areas where unemployment rates are high would vote for Hillary as well. This is similar to what we have known: Demoncratics usually do better among minority and among poverty areas.

17. Run a logistic regression to predict the winning candidate in each county. Save training and test errors to records variable. What are the significant variables? Are the consistent with what you saw in decision tree analysis? Interpret the meaning of a couple of the significant coefficients in terms of a unit change in the variables.

glm.fit <- glm(candidate ~.,
             data = trn.cl, family=binomial)
# Summarize the logistic regression model
kable(glm.fit$coefficients, caption = "logistic coefficient"%>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
logistic coefficient
x
(Intercept) -11.5774984
Men 0.0858465
White -0.2216640
Citizen 0.1018512
Income -0.0000758
IncomeErr -0.0000388
IncomePerCap 0.0002676
IncomePerCapErr -0.0002785
Poverty 0.0199018
ChildPoverty -0.0074088
Professional 0.2816889
Service 0.3658750
Office 0.1051116
Production 0.1844630
Drive -0.2562350
Carpool -0.2462987
Transit -0.0086047
OtherTransp -0.0969895
WorkAtHome -0.2100919
MeanCommute 0.0632558
Employed 0.1652801
PrivateWork 0.0925337
SelfEmployed 0.0122624
FamilyWork -1.1913135
Unemployment 0.1837887
Minority -0.0891958

In above results, White, Citizen, IncomePerCap, Professional, Service, Production, Drive, Carpool, Employed, PrivateWork and Unemployment are statistically highly significant at level 0.001. The logistic regression coefficients, if logit link function is used, give the change in the log odds of the outcome for a one unit increase in a predictor variable, while others being held constant. For example:

  • The variable White has a coefficient -0.2217. For every one unit increase in White, the log odds of voting for Hillary (versus voting for Trump) decreases by 0.2217, holding other variables fixed. This is consistent with what we established in decision tree model: White population likes to vote for Trump.

  • The variable Unemployment has a coefficient 0.1838. For a one unit increase in Unemloyment, the log odds of voting for Hillary (versus voting for Trump) increases by 0.1838, holding other variables fixed. This is also consistent with what we seen in decision tree model: Unemployed population likes to vote for Hilliary.

Some other variables like IncomePerCap are also consistant with the conclusions we got from decision tree: Hilliary are more popular than Trump in poverty area. However, in decision tree, we use Transit as an important feature for classification. But in our logistic model, while holding other variables fixed, transit does not have a statistically significant value.

# predict training and testing datasets
prob.training <- predict(glm.fit, newdata = trn.cl, type="response")
prob.training <- round(prob.training, digits=2)

prob.test <- predict(glm.fit, newdata = tst.cl, type="response")
prob.test <- round(prob.test, digits=2)
# Save the predicted labels using 0.5 as a threshold
pred.logistic.training <- as.factor(ifelse(prob.training<=0.5, "Donald Trump", "Hillary Clinton"))
pred.logistic.test <- as.factor(ifelse(prob.test<=0.5, "Donald Trump", "Hillary Clinton"))

# training error 
logistic_training_error <- 
  calc_error_rate(pred.logistic.training, droplevels(trn.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
# testing error 
logistic_testing_error <- 
  calc_error_rate(pred.logistic.test, droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))

# pass in to records 
records[2, 1] <- logistic_training_error
records[2, 2] <- logistic_testing_error

kable(records, caption = "Records", "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
Records
train.error test.error
tree 0.0749186 0.0830619
logistic 0.0671824 0.0749186
lasso NA NA
random forest NA NA
knn NA NA

18. You may notice that you get a warning glm.fit: fitted probabilities numerically 0 or 1 occurred. As we discussed in class, this is an indication that we have perfect separation (some linear combination of variables perfectly predicts the winner). This is usually a sign that we are overfitting. One way to control overfitting in logistic regression is through regularization. Use the cv.glmnet function from the glmnet library to run K-fold cross validation and select the best regularization parameter for the logistic regression with LASSO penalty. Reminder: set alpha=1 to run LASSO regression, set lambda = c(1, 5, 10, 50) * 1e-4 in cv.glmnet() function to set pre-defined candidate values for the tuning parameter \(\lambda\). This is because the default candidate values of \(\lambda\) in cv.glmnet() is relatively too large for our dataset thus we use pre-defined candidate values. What is the optimal value of \(\lambda\) in cross validation? What are the non-zero coefficients in the LASSO regression for the optimal value of \(\lambda\)? How do they compare to the unpenalized logistic regression? Save training and test errors to the records variable.

set.seed(1)

cv.out.lasso <- cv.glmnet(trn.cl %>% 
                            select(-candidate) %>% 
                            as.matrix, trn.cl$candidate %>% 
                            droplevels(except = c("Donald Trump", "Hillary Clinton")), alpha = 1, lambda = c(1, 5, 10, 50) * 1e-4, nfolds = 10, family="binomial")
bestlam <- cv.out.lasso$lambda.min
# bestlam 

lasso.mod <- glmnet(trn.cl %>% 
                            select(-candidate) %>% 
                            as.matrix, trn.cl$candidate %>% 
                            droplevels(except = c("Donald Trump", "Hillary Clinton")), alpha = 1, lambda = bestlam, family="binomial")

# predict training and testing datasets
lasso.pred.training <- predict(lasso.mod, newx = trn.cl %>% select(-candidate) %>% as.matrix, type = "class") 
lasso.pred.testing <- predict(lasso.mod, newx = tst.cl %>% select(-candidate) %>% as.matrix, type = "class") 

# training error 
lasso_training_error <- 
  calc_error_rate(lasso.pred.training, droplevels(trn.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
# testing error 
lasso_testing_error <- 
  calc_error_rate(lasso.pred.testing, droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))

# pass in to records 
records[3, 1] <- lasso_training_error
records[3, 2] <- lasso_testing_error
kable(records, caption = "Records", "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) 
Records
train.error test.error
tree 0.0749186 0.0830619
logistic 0.0671824 0.0749186
lasso 0.0663681 0.0781759
random forest NA NA
knn NA NA
lasso.coef <- predict(lasso.mod, type="coefficients")[1:26,]
kable(lasso.coef, caption = "lasso coefficient"%>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
lasso coefficient
x
(Intercept) -20.4054760
Men 0.0491127
White -0.1238268
Citizen 0.1111130
Income -0.0000465
IncomeErr -0.0000466
IncomePerCap 0.0001982
IncomePerCapErr -0.0002012
Poverty 0.0149537
ChildPoverty 0.0000000
Professional 0.2553687
Service 0.3352589
Office 0.0779562
Production 0.1497450
Drive -0.2083447
Carpool -0.1936119
Transit 0.0417913
OtherTransp -0.0393694
WorkAtHome -0.1516355
MeanCommute 0.0435754
Employed 0.1563779
PrivateWork 0.0848988
SelfEmployed 0.0000000
FamilyWork -1.0274680
Unemployment 0.1746343
Minority 0.0000000

The optimal value of \(\lambda\) in cross validation is 5e-04. The non-zero coefficients in the LASSO regression are every variable except for SelfEmployed, ChildPoverty and Minority. The absolute value of most of the values in the lasso coeffient is half of those in logistic regression.

19. Compute ROC curves for the decision tree, logistic regression and LASSO logistic regression using predictions on the test data. Display them on the same plot. Based on your classification results, discuss the pros and cons of the various methods. Are the different classifiers more appropriate for answering different kinds of questions about the election?

# Find the testing probabilities
prob.tree.test <- predict(tree.election.pruned, tst.cl, type="vector")
prob.logistic.test <- predict(glm.fit, tst.cl, type="response")
prob.lasso.test <- predict(lasso.mod, newx = tst.cl %>% select(-candidate) %>% as.matrix, type = "response")

# First arument is the predicted probabilities, second is true labels
pred_tree <- prediction(prob.tree.test[, 13], droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
pred_logistic <- prediction(prob.logistic.test, droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
pred_lasso <- prediction(prob.lasso.test, droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))

# Secondly, we calculate the True Positive Rate and False Positive Rate by `performance()`.
# We want TPR on the y axis and FPR on the x axis
perf_tree <- performance(pred_tree, measure="tpr", x.measure="fpr")
perf_logistic <- performance(pred_logistic, measure="tpr", x.measure="fpr")
perf_lasso <- performance(pred_lasso, measure="tpr", x.measure="fpr")

# Plot the ROC curve
plot(perf_tree, col=2, lwd=2, main="ROC curve")
abline(0,1)
par(new=TRUE)
plot(perf_logistic, col=6, lwd=2)
par(new=TRUE)
plot(perf_lasso, col=3, lwd=2)

# Calculate AUC
auc.tree <- performance(pred_tree, "auc")@y.values
auc.logistic <- performance(pred_logistic, "auc")@y.values
auc.lasso <- performance(pred_lasso, "auc")@y.values
# auc.tree
# auc.logistic
# auc.lasso

From the ROC curve, we get the lasso prediction AUC is 0.9657537, which is larger than the AUC of tree prediction and logistic prediction, which is 0.917 and 0.96 respectively. Usually, an AUC larger than 0.95 is generally a good classification. This tells us lasso and logistic are better classfication than decision tree in this case, and by using regularization for the logistic model with the lasso penatly, we decrease overfitting on the training dataset, therefore has better prediction results on the testing dataset. The decision tree method gives us a clear visualization about the tree splits based on important factors, but lasso model and logistic model provide better analysis of each variable’s importance while holding other variables fixed. This answers the question of what are some of the most important factors affecting voters’ choice while holding all other variables constant.

Taking it further

20. This is an open question. Interpret and discuss any overall insights gained in this analysis and possible explanations. Use any tools at your disposal to make your case: visualize errors on the map, discuss what does/doesn’t seems reasonable based on your understanding of these methods, propose possible directions (collecting additional data, domain knowledge, etc). In addition, propose and tackle at least one more interesting question. Creative and thoughtful analyses will be rewarded! This part will be worth up to a 20% of your final project grade!

Visualize errors on the map

We combine ‘county’ and ‘tmpcensus’into ’election_map’ using ‘left_join’, and then use our previous decision tree model and logistic regression model to predict on ‘election_map’. ‘Election_map’ has all the observations from ‘county’, and contain all the necessary information to draw a U.S map.

county_map <- county %>% ungroup %>%
  mutate(state = state.name[match(state, state.abb)]) %>%               ## state abbreviations
  mutate_at(vars(state, county), tolower) %>%                           ## to all lowercase
  mutate(county = gsub(" county| columbia| city| parish", "", county))  ## remove suffixes
election_map <- county_map %>%
  left_join(tmpcensus, by = c("state"="State", "county"="County")) %>% 
  na.omit

# Decision tree model error 
pred.tree.map <- predict(tree.election.pruned, election_map, type="class")
error.map <- election_map %>% mutate(prediction = ifelse(election_map$candidate != pred.tree.map, "incorrect", "correct"))
ggplot(data = error.map) + 
  geom_polygon(aes(x = long, y = lat, fill = prediction, group = group), color = "white") + 
  coord_fixed(1.3) + 
  ggtitle("Predition outcome for decision tree")

# Logistic model error 
prob.logistic.map <- predict(glm.fit, election_map, type="response") 
pred.logistic.map <- as.factor(ifelse(prob.logistic.map<=0.5, "Donald Trump", "Hillary Clinton"))
error.map <- election_map %>% mutate(prediction = ifelse(droplevels(election_map$candidate, except = c("Donald Trump", "Hillary Clinton")) != pred.logistic.map, "incorrect", "correct"))
ggplot(data = error.map) + 
  geom_polygon(aes(x = long, y = lat, fill = prediction, group = group), color = "white") + 
  coord_fixed(1.3) + 
  ggtitle("Predition outcome for logistic regression")

As we can see from above, logistic model has less incorrect prediction than decision tree model. Also worth noticing that most of the prediction improvements are from the west side of the countries.

Questions and concerns

  • One thing that seems odd to me is the Transit variable in decision tree classification. We later found out that Income/IncomePerCap, Unemplyment and White variables used by decision tree model are also significantly important in other classification methods. Transit, however, does not show a level a significance when we use logistic regression or lasso regression. One possible explanation for this could be although Transit as a single variable does not have meaningful impart on predicting voters’ preference, it has strong correlation with other variables. Therefore, we can also further explore this relationship by looking at the interaction terms in regression model. This will allow us to test more hypothesis about the effect of Transit for different values of other variables and therefore reach more accurate prediction results.

  • In PCA, IncomePerCap, ChildPoverty and Poverty are the three most important factors in the first principal component. However, ChildPoverty and Minority have coefficients of 0 in our Lasso regression. This might look contradictary at first, but then again this could be due to the ChildPoverty is higly related to, say Poverty. And we have only looked at the variables separately while holding all others constant in the regression models. Also even though we suspect that minority has a significant impact on voters’ behavior, this variable could also highly correlated with variables like White. Different values of White can mean different impact of Minority on the regression model.

  • ChildPoverty and Minority have negative relationships with voting for Hillary. Although not significant, this seem strange to me because we thought larger ChildPoverty rate or Minority rate means more support for Hillary. In general, we can interpret and explain the results of our models better if we have clearer understanding of the U.S politics and/or past voting history. Perhaps we can gether more information from the past election dataset, and then build more models to analyze different factors affecting different voting behavior.

Random forest

trn.cl$candidate <- droplevels(trn.cl$candidate, except = c("Donald Trump", "Hillary Clinton"))
rf.election <- randomForest(candidate ~ ., data=trn.cl, ntree=500, importance=TRUE)
# rf.election
plot(rf.election)

kable(importance(rf.election)%>% head, "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
Donald Trump Hillary Clinton MeanDecreaseAccuracy MeanDecreaseGini
Men 8.5451155 11.707915 13.94213 16.55625
White 25.4968974 35.972362 36.08114 84.35682
Citizen 12.3470046 5.343751 13.81806 15.98468
Income 12.0782016 15.205013 19.20286 21.34947
IncomeErr 0.3424663 12.007279 11.33028 14.97626
IncomePerCap 13.1941166 15.353247 19.30548 24.62288
varImpPlot(rf.election, sort=T, main="Variable Importance for Predicting Election", n.var=5)

pred.rf.train <- predict (rf.election, newdata = trn.cl)
pred.rf.test <- predict (rf.election, newdata = tst.cl)

# training error 
rf_training_error <- 
  calc_error_rate(pred.rf.train, droplevels(trn.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
# testing error 
rf_testing_error <- 
  calc_error_rate(pred.rf.test, droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))

# pass in to records 
records[4, 1] <- rf_training_error
records[4, 2] <- rf_testing_error

kable(records, caption = "Records", "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
Records
train.error test.error
tree 0.0749186 0.0830619
logistic 0.0671824 0.0749186
lasso 0.0663681 0.0781759
random forest 0.0012215 0.0456026
knn NA NA

We see that top 5 most impartant factors are Transit, White, Minority Professional and Unemployment. This is pretty consistent with what we have for other classification models. We have greatly improve our classification model by reducing the error rate of training and testing to 0.001222 and 0.0439, respectively.

KNN

We also would like to try the K nearst neighbor method for classification.

# YTrain is the true labels for High on the training set, XTrain is the design matrix
YTrain = trn.cl$candidate
XTrain = trn.cl %>% select(-candidate)
# YTest is the true labels for High on the test set, Xtest is the design matrix
YTest = tst.cl$candidate
XTest = tst.cl %>% select(-candidate)

# LOOCV 
validation.error = NULL
allK = 1:50
set.seed(66)
# For each number in allK, use LOOCV to find a validation error  
for (i in allK){  # Loop through different number of neighbors
    pred.Yval = knn.cv(train=XTrain, cl=YTrain, k=i) # Predict on the left-out validation set
    validation.error = c(validation.error, mean(pred.Yval!=YTrain)) # Combine all validation errors
}
numneighbor = max(allK[validation.error == min(validation.error)])
# numneighbor

# knn on the train set
pred.YTtrain = knn(train=XTrain, test=XTrain, cl=YTrain, k=numneighbor)
conf.train = table(predicted=pred.YTtrain, true=YTrain)
knn.train.err <- 1 - sum(diag(conf.train)/sum(conf.train))

# knn on test set
pred.YTest = knn(train=XTrain, test=XTest, cl=YTrain, k=numneighbor)
conf.test = table(predicted=pred.YTest, true=YTest)
knn.test.err <- 1 - sum(diag(conf.test)/sum(conf.test))

records[5, 1] <- knn.train.err
records[5, 2] <- knn.test.err
kable(records, caption = "Records", "html")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
Records
train.error test.error
tree 0.0749186 0.0830619
logistic 0.0671824 0.0749186
lasso 0.0663681 0.0781759
random forest 0.0012215 0.0456026
knn 0.1307003 1.0000000

Using LOOCV method, the best number of neighbors is 27. Training our election data on KNN, we get a training error rate of 0.13 and a testing error rate of 0.15. Obviously, this is not a very good classification model for our election dataset.